##packs for data processing
library(rjson) # to read .JSON files.
library(tidyr) # to process data
library(dplyr) # to process data
library(purrr) # to process data
library(lubridate) # to deal with date variables
#packs for data viz
library(sp) # a pack for spatial objects
library(leaflet) # map and its functions
Introduction to the series: “Visualisation of My Personal Google Data”
If you give permission to Google to store your location data, they will keep them in their databases forever. You can also allow them to store it for a while and then ask them delete it. They will directly do so.
What makes this study a fun project is its being very personal. I decided to analyze my personal data in August, 2022. Therefore, I granted many new permissions to Google along with many previously granted permissions. They keep them in various formats including .csv, .json, .mbox etc. When you query for your personal data, they provide it within a couple of days depending on the size of the data you queried.
Usually, I provide the readers with the data in my posts. However, in this series, the data are very personal and so I will not.
Introduction: “My Locations”
In this part of the series, we will investigate my personal location data. We will visualize the spots I visited within a period of time. This way, I personally will gain insights about how boring my days are :)
The R packages that we use in this post are as follows: rjson
, tidyr
, dplyr
, purrr
, lubridate
, sp
and leaflet
.
Understand the Data
Inside the takeout folder that I received from Google, there is a folder named “Location History”. Inside it, “Semantic Location History” contains the location data based on the months and years. From that folder, I have called the locations I visited in November. Thus, we will use 2022_NOVEMBER.json file. Let’s investigate the data. Start with reading the file into R environment.
my_locations <- fromJSON(file = "2022_NOVEMBER.json")
Then, let’s try to understand the structure of the data, how and what kind of information is stored into its cells. The list object my_locations
contains many lists inside it. Let’s try to understand each one of them one by one:
summary(my_locations[[1]])
Length Class Mode
[1,] 1 -none- list
[2,] 1 -none- list
[3,] 1 -none- list
[4,] 1 -none- list
[5,] 1 -none- list
[6,] 1 -none- list
[7,] 1 -none- list
[8,] 1 -none- list
[9,] 1 -none- list
[10,] 1 -none- list
[11,] 1 -none- list
[12,] 1 -none- list
[13,] 1 -none- list
[14,] 1 -none- list
[15,] 1 -none- list
[16,] 1 -none- list
[17,] 1 -none- list
There are many smaller lists in the first indexed list. Let’s try the first one and see what’s inside:
summary(my_locations[[1]][[1]])
Length Class Mode
placeVisit 11 -none- list
There is a single list inside. Sad :( Let’s dive one more step:
summary(my_locations[[1]][[1]][[1]])
Length Class Mode
location 8 -none- list
duration 2 -none- list
placeConfidence 1 -none- character
centerLatE7 1 -none- numeric
centerLngE7 1 -none- numeric
visitConfidence 1 -none- numeric
otherCandidateLocations 4 -none- list
editConfirmationStatus 1 -none- character
locationConfidence 1 -none- numeric
placeVisitType 1 -none- character
placeVisitImportance 1 -none- character
Finally, here we have several items. There is a list called location
containing 8 items inside. There is duration
with 2 items and otherCandidateLocations
with 4 items. Other lists contain only one item each. Let’s check these one by one:
summary(my_locations[[1]][[1]][[1]]$location)
Length Class Mode
latitudeE7 1 -none- numeric
longitudeE7 1 -none- numeric
placeId 1 -none- character
address 1 -none- character
semanticType 1 -none- character
sourceInfo 1 -none- list
locationConfidence 1 -none- numeric
calibratedProbability 1 -none- numeric
summary(my_locations[[1]][[1]][[1]]$duration)
Length Class Mode
startTimestamp 1 -none- character
endTimestamp 1 -none- character
summary(my_locations[[1]][[1]][[1]]$otherCandidateLocations)
Length Class Mode
[1,] 7 -none- list
[2,] 7 -none- list
[3,] 7 -none- list
[4,] 7 -none- list
We can obtain much information through this investigation process. For instance, inside the location
I can see information about the latitude, longitude, address, the confidence that I have to this place, and some other. Here, if you are following along with me, please spare some time to understand your data. Delve into them and digest as much information as you can. I will see you in the next section: data processing.
Pre-processing
You can use as many items as you want in your work. You should decide the meaningful information while understanding your data. Now let’s re-define our lists as a dataframe.
There are many columns, some of which I won’t need. Especially, I am not interested in the locations defined as “candidate”. I will exclude them from my study. They are probably the locations that might be the place that I visited ordered by possibility. I just need the one with the highest possibility, which is tagged with placeVisit.location.
. These locations are also defined as “HIGH CONFIDENCE”. Let’s continue the analysis with these locations, only.
Also, there are some columns with no entry. Let me exclude them with a function. Let the function be called not_all_na
. This is a function that drops all the columns which are completely empty:
Now, I have a dataframe with 150+ columns. However, I just need the information about latitude, altitude, date and address of the locations that I visited. Let’s write a query to get this data into a new dataframe:
The chunks of code above ask for columns whose names contain the extensions written in quotation marks in them. Still, this raw information isn’t enough for several reasons. Firstly, lat
and lot
are coordinates in E7 format. With a quick research on the internet, I learned that they simply need to be divided by 10000000. Also, date
contains day, month, year, hour, minute, second and time zone (which is in GMT+0 format) information all in the same column. They need to be handled. Let’s start with the second issue (the one about date
):
Date
1 2022-11-06T13:12:38.091Z
2 2022-11-06T13:24:32.338Z
3 2022-11-06T13:59:00.539Z
4 2022-11-06T14:17:15.462Z
5 2022-11-06T14:39:29.138Z
6 2022-11-07T05:32:14.277Z
As can be seen above, there are two separators: One is “T” separating day and time info. The other is “.” separating time and time zone info. Follow the notes in the code to grasp the process:
#divide the day and hour info from the time zone info, then drop the time zone:
date <-
separate(
data = date,
col = Date,
into = c("Date", "zone"),
sep = "\\."
)
date <- date[-c(2)]
#Now, transform the time in local time zone which is GMT+3:
date$Date<-as.POSIXct(date$Date, format="%Y-%m-%dT%H:%M:%S", tz=Sys.timezone())+ hours(3)
#divide the day and hour info:
date <-
separate(
data = date,
col = Date,
into = c("Day", "Hour"),
sep = " "
)
#see the new format:
head(date)
Day Hour
1 2022-11-06 16:12:38
2 2022-11-06 16:24:32
3 2022-11-06 16:59:00
4 2022-11-06 17:17:15
5 2022-11-06 17:39:29
6 2022-11-07 08:32:14
Nicely done! Now gather all the information that we need into a dataframe. Again follow along the notes in the code:
coords <-
drop_na(data.frame(
lat = unlist(lat, use.names = FALSE) / 10000000, #divide lat and lon by 10000000 to get rid of the E7 format
lon = unlist(lon, use.names = FALSE) / 10000000,
address = unlist(address, use.names = FALSE),
date # we processed this before
))
So far, we have worked to prepare for the data visualization process. Our data is ready with the name coords
. Let’s continue with the visualization.
Data Visualization
At this point, we will visualize the locations I visited in November of 2022 on a world map. You can’t be as disappointed as me when you see that I live a life between home and work. Yet, the point here is to see the process of visualization. We owe this beautiful project to the R package leaflet. It is actually a javascript library, all its arguments are deployed into R environment too. Therefore, we can work with it. If you are still with me, I mhighly recommend you to read the documentation of the package leaflet
. Then, follow along the notes in the code and try to understand it if you are not familiar with it.
coordinates(coords) <- ~ lon + lat
leaflet(coords,
# formating the outer of the map:
width = "800px",
height = "400px",
padding = 10) %>%
addTiles() %>%
#formating the markers on the map:
addCircleMarkers(
color = "tomato", #my favorite colour
fillOpacity = 1,
radius = 7,
stroke = FALSE,
#address pops up when you click on a marker:
popup = coords$address,
#the date and hour shows up with a fancy personal note when you hover on a marker:
label = paste0("I have been around here on ", coords$Day, " at around ", coords$Hour),
#formating the label that shows up when you hover:
labelOptions = labelOptions(
noHide = F,
direction = "top",
style = list(
"color" = "black",
"font-family" = "calibri", #I love calibri
"box-shadow" = "3px 3px rgba(0,0,0,0.25)",
"font-size" = "12px",
"border-color" = "rgba(0,0,0,0.5)"
)
)
)
Conclusion
Working with your personal data gives you the opportunity to understand your own habits, likes, dislikes, and maybe future expectations. Here, you can only see my locations in November. When I worked on longer periods, I realized that I need to travel and see new places more often. Even if they are in my own city, a new place is a new vision of life.
Visualizing data on spatial environments is a new challenge for me. Rather than graphs and charts, working with maps are more attractive obviously. While visualizing location data on maps, leaflet
is an amazing, open source library. There are other options. One needs a mention here: ggmap
. Yet, to use this package you need an API key obtained from Google. For more information about API keys, visit here. As of the package, you can visit the CRAN page of ggmap. Under the title “Google Maps API key”, you will see the procedure to buy a personal API key. It reads as follows:
GOOGLE MAPS API KEY [@ggmap]
A few years ago Google has changed its API requirements, and ggmap users are now required to register with Google. From a user’s perspective, there are essentially three ramifications of this:
Users must register with Google. You can do this at https://mapsplatform.google.com. While it will require a valid credit card (sorry!), there seems to be a fair bit of free use before you incur charges, and even then the charges are modest for light use.
Users must enable the APIs they intend to use. What may appear to ggmap users as one overarching “Google Maps” product, Google in fact has several services that it provides as geo-related solutions. For example, the Maps Static API provides map images, while the Geocoding API provides geocoding and reverse geocoding services. Apart from the relevant Terms of Service, generally ggmap users don’t need to think about the different services. For example, you just need to remember that
get_googlemap()
gets maps,geocode()
geocodes (with Google, DSK is done), etc., and ggmap handles the queries for you. However, you do need to enable the APIs before you use them. You’ll only need to do that once, and then they’ll be ready for you to use. Enabling the APIs just means clicking a few radio buttons on the Google Maps Platform web interface listed above, so it’s easy.Inside R, after loading the new version of ggmap, you’ll need provide ggmap with your API key, a hash value (think string of jibberish) that authenticates you to Google’s servers. This can be done on a temporary basis with
register_google(key = "[your key]")
or permanently usingregister_google(key = "[your key]", write = TRUE)
(note: this will overwrite your~/.Renviron
file by replacing/adding the relevant line). If you use the former, know that you’ll need to re-do it every time you reset R.
Your API key is private and unique to you, so be careful not to share it online, for example in a GitHub issue or saving it in a shared R script file. If you share it inadvertantly, just get on Google’s website and regenerate your key - this will retire the old one. Keeping your key private is made a bit easier by ggmap scrubbing the key out of queries by default, so when URLs are shown in your console, they’ll look something like key=xxx
. (Read the details section of the register_google()
documentation for a bit more info on this point.)
Stay tuned!
This series continues with the visualization of my Google Fit data. We will delve into my exercise habbits.